REPORT LAYOUT: - Introduction - General Analysis and insights (try to find unique insights) - Analysis of factors affecting revenue - ML and prediction of select movies

Executive Summary

Project Framework: Definition and Methodology

In this report, we will explore the various factors that impact and influence the monetary success of a movie at the box office. Our investigation extends beyond mere fiscal considerations, encompassing a nuanced examination of critical determinants such as the expertise of the cast and crew. By scrutinizing these diverse components, the report aims to provide a comprehensive understanding of the factors that defines a movie’s monetary success at the box office.

The data was obtained through the use of our own web scraping algorithm and covers the top 75 grossing movies over the past 25 years.

Temporal Analysis

Over time, the average revenue demonstrates a distinct upward trend, with a notable observation regarding the rate of growth in Foreign revenue compared to Domestic revenue. The surge in global revenue is primarily driven by the rapid expansion of foreign revenue, highlighting the escalating growth and acceptance of Western films in international markets.

The onset of the Covid-19 Pandemic significantly impacted the film industry, evident in the graph. Productions were halted, and theaters closed, leading to a substantial loss of earning potential. The lockdown measures globally disrupted filming schedules, postponed releases, and the closure of theaters eliminated a crucial avenue for revenue. This had a ripple effect across the industry, affecting filmmakers, actors, crew members, distributors, and exhibitors. The industry’s vulnerability to external shocks became apparent, prompting the need for innovative adaptations to navigate the challenges such as online releases.

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The impact of the month of film release is a fascinating observation. Notably, films hitting the screens in May and June consistently outperform those released in other months. Utilizing an analysis of variance (ANOVA) shows a significant disparity in average revenue across different release months. Several factors contribute to this phenomenon:

  1. Summer Blockbuster Season: May and June fall within the traditional summer movie season in numerous regions. Studios strategically unveil high-budget blockbuster films during this period, targeting a broad audience. The warmer weather and school vacations further boost movie attendance.

  2. Strategic Release Patterns: The film industry acknowledges this pattern, leading to a clustering effect. Recognizing the advantageous months, more popular and anticipated films tend to be strategically released during May and June. This intentional scheduling capitalizes on the observed heightened audience engagement during these months.

  3. Genre Preferences: Certain movie genres, such as action, adventure, and fantasy, are often associated with May and June releases as seen in the graph. These genres tend to draw larger audiences and generate higher revenue, contributing to the observed pattern.

## `summarise()` has grouped output by 'Month'. You can override using the
## `.groups` argument.
term df sumsq meansq statistic p.value
Month 11 8.114459e+18 7.376781e+17 10.76122 0
Residuals 1848 1.266797e+20 6.854964e+16 NA NA
month_seasonality <- release_month %>% 
  dplyr::select(-Genres, -Title) %>% 
  dplyr::arrange(`Earliest Release Date`) %>%
  dplyr::group_by(`Earliest Release Date`) %>% 
  dplyr::summarise(Revenue = mean(Worldwide))

ts <- as_tsibble(month_seasonality, index = `Earliest Release Date`) %>% 
  tsibble::group_by_key() %>%
  tsibble::index_by(freq = ~ yearmonth(.)) %>%
  dplyr::summarise(value = mean(Revenue), .groups = c("keep")) %>% 
  tsibble::fill_gaps()

ts %>% feasts::gg_subseries(value)

ts %>% 
  fabletools::features(value, feasts::feat_stl)
## # A tibble: 1 × 9
##   trend_strength seasonal_strength_year seasonal_peak_year seasonal_trough_year
##            <dbl>                  <dbl>              <dbl>                <dbl>
## 1          0.521                  0.582                  5                    8
## # ℹ 5 more variables: spikiness <dbl>, linearity <dbl>, curvature <dbl>,
## #   stl_e_acf1 <dbl>, stl_e_acf10 <dbl>
genre <- df %>% 
  dplyr::select(Title, Worldwide, Genres) %>% 
  tidyr::separate_rows(Genres, sep = ", ")

genre_plot <- genre %>% 
  ggplot(aes(x = Genres, y = Worldwide, fill = Genres)) +
  geom_boxplot() +
  coord_flip() +  # To make the plot horizontal for better readability
  labs(title = "Distribution of Revenue Across Genres",
       x = "Genres",
       y = "Revenue") +
  theme(legend.position = "none")
  
genre_plot

genre_summary <- genre %>%
  group_by(Genres) %>%
  summarise(Avg_Revenue = mean(Worldwide, na.rm = TRUE)) %>%
  arrange(desc(Avg_Revenue)) %>% 
  head(10) %>% 
  kbl(caption = "Top 10 Genres by Revenue",
      col.names = c("Genre", "Avg Revenue")) %>% 
  kable_classic(full_width = F, html_font = "Times New Roman")


genre_summary
Top 10 Genres by Revenue
Genre Avg Revenue
Sci-Fi 390186395
Adventure 376108863
Musical 336913540
Fantasy 329471252
Animation 327635090
Action 318624319
Family 312315794
Comedy 230154236
Thriller 222654160
Mystery 217913536
genre_distribution <- genre %>% 
  dplyr::group_by(Genres) %>% 
  dplyr::summarise(Movie_Count = n()) %>% 
  arrange(desc(Movie_Count)) %>% 
  ggplot(aes(x = Genres, y = Movie_Count, fill = Genres)) +
  geom_bar(stat = "identity") +
  labs(title = "Distribution of Movies Across Different Genres",
         x = "Genres",
         y = "Movie Count") +
  coord_flip() +
  theme(legend.position = "none")

genre_distribution

genres_per_movie <- df %>% 
  dplyr::select(Title, Worldwide, Genres) %>% 
  dplyr::mutate(Genre_Count = str_count(Genres, ",") + 1) %>% 
  dplyr::group_by(Genre_Count) %>% 
  dplyr::summarise(Avg_rev = mean(Worldwide),
                   Movie_Count = n()) %>% 
  ggplot(aes(x = factor(Genre_Count), y = Avg_rev, fill = factor(Genre_Count))) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = Movie_Count), vjust = -0.5) +  # Display movie count on top of bars
  labs(title = "Avg Revenue by Number of Genres and Number of Movies",
       x = "Number of Genres",
       y = "Avg Revenue") +
  theme_minimal() +
  theme(legend.position = "none")

genres_per_movie

cast <- df %>% 
  dplyr::select(Title, Worldwide, Cast) %>% 
  tidyr::separate_rows(Cast, sep = ", ") %>% 
  dplyr::group_by(Cast) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n()) %>% 
  dplyr::mutate(Movies_Acted = case_when(
    Count >= 5 & Count <= 10 ~ '5-10',
    Count > 10 & Count <= 15 ~ '10-15',
    Count > 15 & Count <= 20 ~ '15-20',
    Count > 20 ~ '20+',
    TRUE ~ 'Less than 5'
  )) %>% 
  dplyr::group_by(Movies_Acted) %>% 
  dplyr::summarise(AvgRev = mean(AvgRev),
                   Count = n()) %>% 
  dplyr::arrange(desc(AvgRev))
cast
## # A tibble: 5 × 3
##   Movies_Acted     AvgRev Count
##   <chr>             <dbl> <int>
## 1 10-15        340892604.    52
## 2 20+          325170485.     9
## 3 5-10         297994310.   291
## 4 15-20        291142359.    31
## 5 Less than 5  200239184.  2763
# Changing preference for newer faces or different types of story telling. Still like regulars 
range_order <- c("Less than 5", "5-10", "10-15", "15-20", "20+")
cast$Movies_Acted <- factor(cast$Movies_Acted, levels = range_order)
ggplot(cast, aes(x = Movies_Acted, y = AvgRev, fill = factor(Count))) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  scale_fill_viridis_d() +
  labs(title = "Average Revenue by Number of Movies Acted",
       x = "Movies_Acted",
       y = "Average Revenue",
       fill = "Count") +
  theme_minimal()

star <- df %>% 
  dplyr::select(Worldwide, Star) %>% 
  dplyr::group_by(Star) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n()) %>% 
  dplyr::filter(Count >= 5) %>% 
  dplyr::arrange(desc(AvgRev))
star
## # A tibble: 93 × 3
##    Star                   AvgRev Count
##    <chr>                   <dbl> <int>
##  1 Robert Downey Jr. 1065872463.    11
##  2 Chris Pratt        942798923.     9
##  3 Tom Holland        894681963.     5
##  4 Daniel Radcliffe   873331103.     9
##  5 Elijah Wood        700348692.     5
##  6 Daniel Craig       584913297.     8
##  7 Jing Wu            579671044.     6
##  8 Tobey Maguire      548438356.     5
##  9 Kristen Stewart    542133565.     7
## 10 Chris Hemsworth    521315834      6
## # ℹ 83 more rows
star_plot <- star %>%  
  ggplot(aes(x = Count, y = AvgRev, size = Count, color = Count,
             text = paste("Star:", Star, "<br>Number of Movies:", Count, "<br>Average Revenue:", scales::dollar(AvgRev)))) +
  geom_point() +
  labs(title = "Movie Stars and Avg Revenue",
       x = "Number of Movies",
       y = "Average Revenue",
       size = "Number of Movies")
plotly::ggplotly(star_plot, tooltip = "text")

MACHINE LEARNING

library(dplyr)
library(parsnip)
library(ggplot2)
# Split data
set.seed(123)  # Set a seed for reproducibility
#df_shuffled <- df[sample(nrow(df)), ] # Shuffle data to elimate some bias hopefully/
df_split <- df %>% rsample::initial_split(prop = 0.80)
df_train <- rsample::training(df_split)
df_test  <- rsample::testing(df_split)

# Worldwide as dependent variable
recipe_pipeline <- recipes::recipe(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = df_train) %>%
  # step_rm(date) %>%
  recipes::prep()

train_baked <- recipes::bake(recipe_pipeline, df_train)

recipe_pipeline <- recipes::recipe(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = df_test) %>%
  # step_rm(date) %>%
  recipes::prep()

test_baked <- recipes::bake(recipe_pipeline, df_test)  # Corrected to use test_baked


common_levels_distributor <- intersect(levels(train_baked$Distributor), levels(test_baked$Distributor))
train_baked$Distributor <- factor(train_baked$Distributor, levels = common_levels_distributor)
test_baked$Distributor <- factor(test_baked$Distributor, levels = common_levels_distributor)



# Modeling
model <- parsnip::decision_tree(mode = "regression") %>%
  parsnip::set_engine("rpart") %>%
  parsnip::fit(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = train_baked)
model
## parsnip model object
## 
## n= 1492 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1492 1.122213e+20  261062300  
##    2) Budget< 1.445e+08 1335 5.514783e+19  215972100  
##      4) Budget< 8.9e+07 1083 2.741099e+19  179967600 *
##      5) Budget>=8.9e+07 252 2.029945e+19  370705400  
##       10) First Genre=Biography,Comedy,Crime,Documentary,Drama,Fantasy,Horror 41 7.609659e+17  218163700 *
##       11) First Genre=Action,Adventure,Mystery 211 1.839907e+19  400346200  
##         22) Run Time (Mins)< 139 151 5.717508e+18  337193600 *
##         23) Run Time (Mins)>=139 60 1.056373e+19  559280300  
##           46) MPAA=R 14 3.150352e+17  270567100 *
##           47) MPAA=PG,PG-13 46 8.726558e+18  647149600  
##             94) Release Month=Aug,Jan,Jun,Sep 13 1.037251e+18  376371400 *
##             95) Release Month=Apr,Dec,Feb,Jul,Mar,May,Nov,Oct 33 6.360646e+18  753819800 *
##    3) Budget>=1.445e+08 157 3.127963e+19  644473100  
##      6) Budget< 2.185e+08 133 1.527063e+19  562482600  
##       12) Run Time (Mins)< 120.5 66 3.358576e+18  448968700 *
##       13) Run Time (Mins)>=120.5 67 1.022388e+19  674302200  
##         26) Distributor=Lions Gate Films,Paramount Pictures,Sony Pictures Entertainment (SPE),Twentieth Century Fox,Universal Pictures,Warner Bros. 55 5.861216e+18  594204700 *
##         27) Distributor=20th Century Studios,DreamWorks,Walt Disney Studios Motion Pictures 12 2.392533e+18 1041416000 *
##      7) Budget>=2.185e+08 24 1.016019e+19 1098837000  
##       14) Release Month=Jul,Jun,Mar,May,Nov,Oct 17 1.854293e+18  799961800 *
##       15) Release Month=Apr,Dec 7 3.099440e+18 1824678000 *
test_predictions <- predict(model, new_data = test_baked)
test_predictions
## # A tibble: 373 × 1
##         .pred
##         <dbl>
##  1 218163671.
##  2 179967611.
##  3 179967611.
##  4 179967611.
##  5 179967611.
##  6 179967611.
##  7 179967611.
##  8 179967611.
##  9 179967611.
## 10 179967611.
## # ℹ 363 more rows
# Plot tree
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
rpart.plot(
  model$fit,
  roundint = F,
  cex = 1,
  fallen.leaves = F,
  extra = "auto",
  main = "Regression Tree"
)

# Results
res <- model %>% predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% dplyr::select(Worldwide))
res %>% yardstick::metrics(truth = Worldwide, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator     .estimate
##   <chr>   <chr>              <dbl>
## 1 rmse    standard   216512197.   
## 2 rsq     standard           0.283
## 3 mae     standard   132896811.
res %>% ggplot(aes(x = .pred, y = Worldwide)) + geom_point() +
  labs(title = "Prediction vs Actual",
       subtitle = "Decision Tree - Regression")

# Assuming 'model' is your decision tree model and 'df_test' is your testing data
library(parsnip)
library(yardstick)
library(vip)
## Warning: package 'vip' was built under R version 4.3.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
# Assess variable importance
importance <- model %>% vip()

# Assess variable relationships
plot(importance)